You can use RStudio’s graphic user interface to import CSV data into R.
You can explain the concept of reproducibility.
You can use the nrow(), ncol() and
dim() functions to get the dimensions of a dataset, and the
summary() function to get a summary of the dataset’s
variables.
You can use vis_dat(), inspect_num()
and inspect_cat() to obtain visual summaries of a
dataset.
You can inspect a numeric variable:
with the summary functions mean() ,
median(), max(), min(),
length() and sum();
with the graphical functions hist() and
boxplot().
You can inspect a categorical variable:
with the summary functions table() and
janitor::tabyl();
with the graphical functions barplot() and
pie().
With your newly-acquired knowledge of functions and objects, you now have the basic building blocks required to do simple data analysis in R. So let’s get started. The goal is to start working with data as quickly as possible, even before you feel ready.
Here you will analyze a dataset of confirmed and suspected cases of Ebola hemorrhagic fever in Sierra Leone in May and June of 2014 (Fang et al., 2016). The data is shown below:
You will import and explore this dataset, then use R to answer the following questions about the outbreak:
First, open a new script in RStudio with
File > New File > R Script.
Next, save the script with File > Save As or press
Command/Control + S to bring up
the Save File dialog box. Save the file with the name “ebola_analysis”
or something similar
Add a title, name and date to the start of the script, as code comments. This is generally good practice for writing R scripts. Your header may look like this:
# Ebola Sierra Leone analysis
# John Sample-Name Doe
# 2024-01-01You should update the name and the date to your name and the actual date of preparation of the script.
@TO-DO, Name is helpful for other people looking at the script. The date is helpful
Year month day
Next, use the p_load() function from {pacman} to load
the packages you will be using. Put this under a section header called
“Load packages”, with four hyphens, as shown below:
# Load packages ----
if(!require(pacman)) install.packages("pacman")
pacman::p_load(
tidyverse,
inspectdf,
plotly,
janitor,
visdat
)Remember that the full signifier of a function includes both
the package name and the function name,
package::function(). This full signifier is handy if you
want to use a function before you have loaded its source package. This
is the case in the code chunk above: we want use p_load()
from {pacman} without formally loading the {pacman} package, so we type
pacman::p_load()
We could also first load {pacman} before using the p_load function:
library(pacman) # first load {pacman}
p_load(tidyverse) # use `p_load` from {pacman} to load other packages(Also recall that the benefit of p_load() is that it
automatically installs a package if it is not yet installed. Without
p_load(), you have to first install the package with
install.packages() before you can load it with
library().)
Now that the needed packages are loaded, you should import the dataset.
Go to bit.ly/ebola-data to view the dataset you will be working on. Then click the download icon at the top to download it to your computer.
You can leave the dataset in your downloads folder, or move it to somewhere more respectable; the upcoming steps will work independent of where the data is stored. In the next lesson, you will learn how to organize your data analysis projects properly, and we will think about the ideal folder setup for storing data.
NOTE: If you are using RStudio Cloud, you need to upload your dataset to the cloud. Do this in the “Files” tab by clicking on the “Upload” button.
Next, on the RStudio menu, go to
File > Import Dataset > From Text (base).
Browse through the computer’s files and navigate to the downloaded dataset. Click to open it. You should see an import dialog box like this:
Leave all the import settings at the default values. Just click on “Import” at the bottom, and voila, you should have the dataset loaded into R! You can tell this by looking at your environment pane, which should now feature an object called “ebola_sierra_leone” or something similar:
RStudio should also have called the View() function on
your dataset, so you should see a familiar spreadsheet view of this
data:
Now take a look at your console. Do you observe that your actions in
the graphical user interface actually triggered some R code to be run?
Copy the first part of this code, leaving out the >
symbol.
Paste the copied code into your R script, and label this section “Load data”.
In your script, you should also replace the read.csv()
function with read_csv(). read.csv() (with a
period) is from R’s preinstalled {utils} package for data importing,
while read_csv() (with an underscore) is a newer function
from the tidyverse, specifically the {readr} package. This latter
function, read_csv(), is faster and is better in other ways
you will see soon.
Now, you should have something like this in your script:
# Load data ----
ebola_sierra_leone <- read_csv("~/Downloads/ebola_sierra_leone.csv")Nice work so far!
Your R script should look similar to this:
# Ebola Sierra Leone analysis
# John Sample-Name Doe
# 2024-01-01
# Load packages ----
if(!require(pacman)) install.packages("pacman")
pacman::p_load(
tidyverse,
inspectdf,
plotly,
janitor,
visdat
)
# Load data ----
ebola_sierra_leone <- read_csv("~/Downloads/ebola_sierra_leone.csv")Now that the code for importing data is in your R script, you can easily rerun this script anytime to reimport the dataset; there will be no need to redo the manual point-and-click procedure for data import.
Try restarting R and rerunning the script now. Save your script with
Control/Command + s , then restart R
with the RStudio Menu, at Session > Restart R. On
RStudio Cloud, the menu option looks like this:
If restarting is successful, your console should print this message:
You should also see the phrase “Environment is empty” in the Environment tab, indicating that the dataset you imported is no longer stored by R—you are starting with a fresh workspace.
To re-run your script, use Command/Control +
a to highlight all the code, then
Command/Control + Enter to run it.
If this worked, congratulations; you have the beginnings of your first “reproducible” analysis script!
What does “reproducible” mean?
When you do things with code rather than by pointing and clicking, it is easy for anyone to re-run, or reproduce these steps, by simply re-running your script.
While you can use RStudio’s graphical user interface to point-and-click your way through the data import process, you should always copy the relevant code to your script so that your script remains a reproducible record of all your analysis steps.
Of course, your script so far is not yet entirely reproducible, because the file path for the dataset (the one that looks like this: “…intro-to-data-analysis-with-r/ch01_getting_started/data…”) is specific to just your computer. Later on we will see how to use relative file paths, so that the code for importing data can work on anyone’s computer.
If your environment was not empty after restarting R, it means you skipped a step in a previous lesson. Do this now:
In the RStudio Menu, go to Tools > Global Options
to bring up RStudio’s options dialog box.
Then go to General > Basic, and
uncheck the box that says “Restore .RData into
workspace at startup”.
For the option, “save your workspace to .RData on exit”, set this to “Never”.
Now let’s walk through some basic steps of data exploration—taking a broad, bird’s eye look at the dataset. You should type up this section under a heading like “Explore data” in your script.
To view the top and bottom 6 rows of the dataset, you can use the
head() and tail() functions:
# Explore data ----
head(ebola_sierra_leone)## # A tibble: 6 × 7
## id age sex status date_of_onset date_of_sample district
## <dbl> <dbl> <chr> <chr> <date> <date> <chr>
## 1 92 6 M confirmed 2014-06-10 2014-06-15 Kailahun
## 2 51 46 F confirmed 2014-05-30 2014-06-04 Kailahun
## 3 230 NA M confirmed 2014-06-26 2014-06-30 Kenema
## 4 139 25 F confirmed 2014-06-13 2014-06-18 Kailahun
## 5 8 8 F confirmed 2014-05-22 2014-05-27 Kailahun
## 6 215 49 M confirmed 2014-06-24 2014-06-29 Kailahun
tail(ebola_sierra_leone)## # A tibble: 6 × 7
## id age sex status date_of_onset date_of_sample district
## <dbl> <dbl> <chr> <chr> <date> <date> <chr>
## 1 214 6 F confirmed 2014-06-24 2014-06-30 Kenema
## 2 28 45 F confirmed 2014-05-27 2014-06-01 Kailahun
## 3 12 27 F confirmed 2014-05-22 2014-05-27 Kailahun
## 4 110 6 M confirmed 2014-06-10 2014-06-15 Kailahun
## 5 209 40 F confirmed 2014-06-24 2014-06-27 Kailahun
## 6 35 29 M suspected 2014-05-28 2014-06-01 Kenema
TO DO:
To view the whole dataset, use the View() function.
View(ebola_sierra_leone)This will again open a familiar spreadsheet view of the data:
Viewing just the top and bottom 6 rows of your dataset may seem
useless. But when you start working with very large datasets (thousands
to millions of rows), your computer will be overworked if you try to
View() the entire dataset at once. So taking small peeks is
often useful.
nrow(), ncol() and dim() give
you the dimensions of your dataset:
nrow(ebola_sierra_leone) # number of rows## [1] 200
ncol(ebola_sierra_leone) # number of columns## [1] 7
dim(ebola_sierra_leone) # number of rows and columns## [1] 200 7
If you’re not sure what a function does, remember that you can get
function help with the question mark symbol. For example, to get help on
the ncol() function, run:
?ncolAnother often-helpful function is summary():
summary(ebola_sierra_leone)## id age sex status date_of_onset
## Min. : 1.00 Min. : 1.80 Length:200 Length:200 Min. :2014-05-18
## 1st Qu.: 62.75 1st Qu.:20.00 Class :character Class :character 1st Qu.:2014-06-01
## Median :131.50 Median :35.00 Mode :character Mode :character Median :2014-06-13
## Mean :136.72 Mean :33.85 Mean :2014-06-12
## 3rd Qu.:208.25 3rd Qu.:45.00 3rd Qu.:2014-06-23
## Max. :285.00 Max. :80.00 Max. :2014-06-29
## NA's :4
## date_of_sample district
## Min. :2014-05-23 Length:200
## 1st Qu.:2014-06-07 Class :character
## Median :2014-06-18 Mode :character
## Mean :2014-06-17
## 3rd Qu.:2014-06-29
## Max. :2014-07-17
##
As you can see, for numeric columns in your dataset,
summary() gives you the minimum value, the maximum value,
the mean, median and the 1st and 3rd quartiles.
For character columns it gives you just the length of the column (the number of rows), the “class” and the “mode”. We will discuss what “class” and “mode” mean later.
vis_dat()The vis_dat() function from the {visdat} package is a
wonderful way to quickly visualize the data types and the missing values
in a dataset. Try this now:
vis_dat(ebola_sierra_leone)From this figure, you can quickly see the character, date and numeric data types, and you can note that age is missing for some cases.
If the variables date_of_onset and
date_of_sample, don’t show up as “Date” types in your
vis_dat() output, it probably means you imported the
dataset with the read.csv() function (with a period),
instead of read_csv()(with an underscore). One benefit of
read_csv() from {readr} is that it auto-detects date
variables, telling R that they are dates. The standard
read.csv() function does not do this, so date variables may
appear as regular character
variables.
inspect_cat() and inspect_num()Next, inspect_cat() and inspect_num() from
the {inspectdf} package produce nice visual summaries of the
distribution of variables (columns) in the dataset.
If you run inspect_cat() on the data object, you get a
tabular summary of the categorical
variables in the dataset, with some information hidden in the
levels column (later you will learn how to extract this
information).
inspect_cat(ebola_sierra_leone)## # A tibble: 5 × 5
## col_name cnt common common_pcnt levels
## <chr> <int> <chr> <dbl> <named list>
## 1 date_of_onset 39 2014-06-10 10 <tibble [39 × 3]>
## 2 date_of_sample 45 2014-06-15 9.5 <tibble [45 × 3]>
## 3 district 7 Kailahun 77.5 <tibble [7 × 3]>
## 4 sex 2 F 57 <tibble [2 × 3]>
## 5 status 2 confirmed 91 <tibble [2 × 3]>
But the magic happens when you run show_plot() on the
result from inspect_cat():
# store the output of `inspect_cat()` in `cat_summary`
cat_summary <- inspect_cat(ebola_sierra_leone)
# call the `show_plot()` function on that summmary.
show_plot(cat_summary)You get a wonderful figure showing the distribution of all categorical and date variables!
From this plot, you can quickly tell that most cases are in Kailahun, and that there are more cases in women than in men (“F” stands for “female”).
One problem is that in this plot, the smaller categories are not
labelled. So, for example, we are not sure what value is represented by
the white section for “status” at the bottom right. To see labels on
these smaller categories, you can turn this into an interactive plot
with the ggplotly() function from the {plotly} package.
cat_summary_plot <- show_plot(cat_summary)
ggplotly(cat_summary_plot)Wonderful! Now you can hover over each of the bars to see the proportion of each bar section. For example you can now tell that 9% (0.090) of the cases have a suspected status:
You can obtain a similar plot for the numerical (continuous)
variables in the dataset with inspect_num(). Here, we show
all three steps in one go.
num_summary <- inspect_num(ebola_sierra_leone)
num_summary_plot <- show_plot(num_summary)
ggplotly(num_summary_plot)This gives you an overview of the numerical columns, age
and id. (Of course, the distribution of the id
variable is not meaningful.)
You can tell that individuals aged 35 to 40 (mid-point 37.5) are the largest age group, making up 13.8% (0.1377…) of the cases in the dataset.
Now that you have a sense of what the entire dataset looks like, you can isolate and analyze single variables at a time—this is called univariate analysis.
Go ahead and create a new section in your script for this univariate analysis.
# Univariate analysis, numeric variables ----Let’s start by analyzing the numeric age variable.
$To extract a single variable/column from a dataset, use the dollar
sign, $ operator:
ebola_sierra_leone$age # extract the age column in the dataset## [1] 6.0 46.0 NA 25.0 8.0 49.0 13.0 50.0 35.0 38.0 60.0 18.0 10.0 14.0 50.0 35.0 43.0 17.0 3.0 60.0 38.0
## [22] 41.0 49.0 12.0 74.0 21.0 27.0 41.0 42.0 60.0 30.0 50.0 50.0 22.0 40.0 35.0 19.0 3.0 34.0 21.0 73.0 65.0
## [43] 30.0 70.0 12.0 15.0 42.0 60.0 14.0 40.0 33.0 43.0 45.0 14.0 14.0 40.0 35.0 30.0 17.0 39.0 20.0 8.0 40.0
## [64] 42.0 53.0 18.0 40.0 20.0 45.0 40.0 60.0 44.0 33.0 23.0 45.0 7.0 NA 35.0 36.0 42.0 35.0 25.0 30.0 30.0
## [85] 28.0 14.0 20.0 60.0 67.0 35.0 50.0 4.0 28.0 38.0 30.0 26.0 37.0 30.0 3.0 56.0 32.0 35.0 54.0 42.0 48.0
## [106] 11.0 1.8 63.0 55.0 20.0 62.0 62.0 42.0 65.0 29.0 20.0 33.0 30.0 35.0 NA 50.0 16.0 3.0 22.0 7.0 50.0
## [127] 17.0 40.0 21.0 9.0 27.0 52.0 50.0 25.0 10.0 30.0 32.0 38.0 30.0 50.0 26.0 35.0 3.0 50.0 60.0 40.0 34.0
## [148] 4.0 42.0 NA 54.0 18.0 45.0 30.0 35.0 35.0 16.0 26.0 23.0 45.0 45.0 45.0 38.0 45.0 35.0 30.0 60.0 5.0
## [169] 18.0 2.0 70.0 35.0 3.0 30.0 80.0 62.0 20.0 45.0 18.0 28.0 48.0 38.0 39.0 26.0 60.0 35.0 20.0 50.0 11.0
## [190] 36.0 29.0 57.0 35.0 26.0 6.0 45.0 27.0 6.0 40.0 29.0
This list of values is called a vector in R. A vector is a kind of data structure that has elements of one type. In this case, the type is “numeric”. We will formally introduce you to vectors and other data structures in a future chapter. In this lesson, you can take “vector” and “variable” to be synonyms.
To get the mean of these ages, you could run:
mean(ebola_sierra_leone$age)## [1] NA
But it seems we have a problem. R says the mean is NA,
which means “not applicable” or “not available”. This is because there
are some missing values in the vector of ages. (Did you notice this when
you printed the vector?) By default, R cannot find the mean if there are
missing values. To ignore these values, use the argument
na.rm, setting it to T, or
TRUE:
mean(ebola_sierra_leone$age, na.rm = T)## [1] 33.84592
Great! This need to remove the NAs before computing a
statistic applies to many functions. The median() function
for example, will also return NA by default if it is called
on a vector with any NAs:
median(ebola_sierra_leone$age) # does not work## [1] NA
median(ebola_sierra_leone$age, na.rm = T) # works## [1] 35
mean and median are just two of many R
functions that can be used to inspect a numerical variable. Let’s look
at some others.
But first, assign the age vector to a new object, so you don’t have
to keep typing ebola_sierra_leone$age each time.
age_vec <- ebola_sierra_leone$age # assign the vector to the object "age_vec"Now run these functions on age_vec and observe their
outputs:
sd(age_vec, na.rm = T) # standard deviation## [1] 17.26864
max(age_vec, na.rm = T) # maximum age## [1] 80
min(age_vec, na.rm = T) # minimum age## [1] 1.8
summary(age_vec) # min, max, mean, quartiles and NAs## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.80 20.00 35.00 33.85 45.00 80.00 4
length(age_vec) # number of elements in the vector## [1] 200
sum(age_vec, na.rm = T) # sum of all elements in the vector## [1] 6633.8
Now let’s create a graph to visualize the age variable. The two most
common graphics for inspecting the distribution of numerical variables
are histograms
(like the output of the inspect_num() function you saw
earlier) and boxplots.
R has built-in functions for these:
hist(age_vec)boxplot(age_vec)Nice and easy!
Base vs {ggplot2} graphics
Graphical functions like boxplot() and
hist() are part of R’s base graphics package. These
functions are quick and easy to use, but they do not offer a lot of
flexibility, and it is difficult to make beautiful plots with them. So
in this course, when we start looking at data visualization properly, we
will focus on the {ggplot2} package, which is the gold standard for
visualization in R. For example, the nice graphs from
visdat::vis_dat(), inspectdf::inspect_cat(),
and inspectdf::inspect_num() are built on top of
ggplot.
Next, let’s look at a categorical variable, the districts of reported cases:
# Univariate analysis, categorical variables ----
ebola_sierra_leone$district## [1] "Kailahun" "Kailahun" "Kenema" "Kailahun" "Kailahun" "Kailahun"
## [7] "Kailahun" "Kailahun" "Kenema" "Kailahun" "Kailahun" "Kailahun"
## [13] "Kailahun" "Kailahun" "Kailahun" "Kailahun" "Kailahun" "Kenema"
## [19] "Kono" "Kailahun" "Kailahun" "Kailahun" "Kenema" "Kailahun"
## [25] "Kailahun" "Kailahun" "Kailahun" "Kailahun" "Kenema" "Kenema"
## [31] "Kenema" "Kailahun" "Kailahun" "Bo" "Kailahun" "Kailahun"
## [37] "Kailahun" "Kenema" "Kenema" "Kenema" "Kailahun" "Kailahun"
## [43] "Kailahun" "Kailahun" "Kailahun" "Kailahun" "Western Urban" "Kailahun"
## [49] "Kailahun" "Kailahun" "Kailahun" "Kailahun" "Kailahun" "Kailahun"
## [55] "Kailahun" "Kailahun" "Kailahun" "Kailahun" "Kailahun" "Kailahun"
## [61] "Kailahun" "Kenema" "Western Urban" "Kambia" "Kailahun" "Kailahun"
## [67] "Kailahun" "Kailahun" "Kailahun" "Kailahun" "Kailahun" "Kailahun"
## [73] "Kenema" "Kailahun" "Kailahun" "Kenema" "Kailahun" "Kailahun"
## [79] "Kenema" "Kailahun" "Kailahun" "Kailahun" "Kailahun" "Kailahun"
## [85] "Kailahun" "Kailahun" "Kailahun" "Kailahun" "Kailahun" "Kenema"
## [91] "Kailahun" "Kailahun" "Kailahun" "Kono" "Port Loko" "Kenema"
## [97] "Kailahun" "Kailahun" "Kailahun" "Kailahun" "Kenema" "Kailahun"
## [103] "Kailahun" "Kenema" "Kailahun" "Kailahun" "Kailahun" "Kailahun"
## [109] "Kailahun" "Kailahun" "Kenema" "Western Urban" "Kailahun" "Kailahun"
## [115] "Kailahun" "Kailahun" "Kailahun" "Kailahun" "Kailahun" "Kailahun"
## [121] "Kailahun" "Kailahun" "Kenema" "Kailahun" "Kailahun" "Kenema"
## [127] "Kailahun" "Port Loko" "Kailahun" "Kailahun" "Kailahun" "Kailahun"
## [133] "Kailahun" "Kailahun" "Kailahun" "Kailahun" "Kailahun" "Kailahun"
## [139] "Kailahun" "Kailahun" "Kailahun" "Kailahun" "Kailahun" "Kenema"
## [145] "Kenema" "Kailahun" "Kenema" "Kailahun" "Kailahun" "Kailahun"
## [151] "Kailahun" "Kailahun" "Kenema" "Kailahun" "Kailahun" "Kenema"
## [157] "Kailahun" "Kenema" "Kailahun" "Kailahun" "Kenema" "Kailahun"
## [163] "Kailahun" "Kailahun" "Kailahun" "Bo" "Kailahun" "Kailahun"
## [169] "Kailahun" "Kailahun" "Kailahun" "Kailahun" "Kenema" "Kailahun"
## [175] "Kailahun" "Kenema" "Kailahun" "Kailahun" "Kailahun" "Kailahun"
## [181] "Kailahun" "Kailahun" "Kailahun" "Western Urban" "Kailahun" "Kailahun"
## [187] "Kenema" "Kailahun" "Kailahun" "Kailahun" "Kailahun" "Kailahun"
## [193] "Kailahun" "Kenema" "Kenema" "Kailahun" "Kailahun" "Kailahun"
## [199] "Kailahun" "Kenema"
Sorry for printing that very long vector!
Assign this vector to the object district_vec to
simplify your downstream code.
district_vec <- ebola_sierra_leone$districtYou can use the table() function to create a frequency
table of a categorical variable:
table(district_vec)## district_vec
## Bo Kailahun Kambia Kenema Kono Port Loko Western Urban
## 2 155 1 34 2 2 4
You can see that most cases are in Kailahun and Kenema.
table() is a nice “base” function. But there is a better
function for creating frequency tables, called tabyl(),
from the {janitor} package:
tabyl(district_vec)## district_vec n percent
## Bo 2 0.010
## Kailahun 155 0.775
## Kambia 1 0.005
## Kenema 34 0.170
## Kono 2 0.010
## Port Loko 2 0.010
## Western Urban 4 0.020
tabyl() gives you both the counts and the percentage
proportions of each value, and has some other attractive features you
will see later.
To visualize the district variable with base R, pass the output of
table() to the barplot() or pie()
functions:
district_table <- table(district_vec)
barplot(district_table)pie(district_table)Nice and simple! (As you can see, a pie chart is not such a great idea for these data.)
With the functions you have just learned, you have the tools to answer the questions about the Ebola outbreak that were listed at the top. Give it a go. Attempt these questions on your own, then look at the solutions below.
Solutions
min(ebola_sierra_leone$date_of_sample)## [1] "2014-05-23"
We don’t have the date of report, but the first “date_of_sample” (when the Ebola test sample was taken from the patient) is May 23rd. We can use this as a proxy for the date of first report.
hist(ebola_sierra_leone$age)The 30-40 age group had had the largest number of cases.
median(ebola_sierra_leone$age, na.rm = T)## [1] 35
The median age of cases was 35.
sex_table <- table(ebola_sierra_leone$sex)
pie(sex_table)tabyl(ebola_sierra_leone$sex)## ebola_sierra_leone$sex n percent
## F 114 0.57
## M 86 0.43
As seen in the figure, there were more cases in women. Specifically, 57% of cases are of women.
district_table <- table(ebola_sierra_leone$district)
barplot(district_table)As seen in the figure, the Kailahun district had the majority of cases.
# for dates you have to set the `breaks` argument. See ?hist
hist(ebola_sierra_leone$date_of_onset, breaks = "days")It is debatable; a precise trend is not clear.
Congratulations! You have now taken your first baby steps in analyzing data with R: you imported a dataset, explored its structure, performed basic univariate analysis on its numeric and categorical variables, and you were able to answer important questions about the outbreak based on this.
Of course, this was only a sneak peek of the data analysis process—a lot was left out. We skipped over the process of data cleaning and wrangling, for example, by giving you a very clean dataset, and we focused on very simple visualizations with R’s base graphics package. Soon, you will learn to use the {dplyr} package for cleaning and manipulating “wilder” datasets, and you will learn to use {ggplot2}, the gold standard for making custom visualizations in R.
Hopefully, though, this sneak peek has gotten you a bit excited about what you can do with R. The journey is only beginning! See you soon.
The following team members contributed to this lesson:
Some material in this lesson was adapted from the following sources: